2023-12-07
tidymodels example with interactionand a couple of secrets, hidden for now.
| response | n | missing | mean | sd |
|---|---|---|---|---|
| y | 142 | 0 | 47.83 | 26.94 |
| x | 142 | 0 | 54.27 | 16.77 |
df tibble: set n mean_x sd_x mean_y sd_y corr_xy
1 1 142 54.27 16.77 47.83 26.94 -0.06
2 2 142 54.27 16.77 47.83 26.94 -0.07
3 3 142 54.27 16.76 47.84 26.93 -0.07
4 4 142 54.26 16.77 47.83 26.94 -0.06
5 5 142 54.26 16.77 47.84 26.93 -0.06
6 6 142 54.26 16.77 47.83 26.94 -0.06
7 7 142 54.27 16.77 47.84 26.94 -0.07
8 8 142 54.27 16.77 47.84 26.94 -0.07
9 9 142 54.27 16.77 47.83 26.94 -0.07
10 10 142 54.27 16.77 47.84 26.93 -0.06
11 11 142 54.27 16.77 47.84 26.94 -0.07
12 12 142 54.27 16.77 47.83 26.94 -0.07
13 13 142 54.26 16.77 47.84 26.93 -0.07
set_1 <- lm(y ~ x, data = df |> filter(set == 1))
tidy(set_1, conf.int = T, conf.level = 0.9) |>
select(-statistic, -p.value) |> kable(digits = 2)| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| (Intercept) | 53.43 | 7.69 | 40.69 | 66.16 |
| x | -0.10 | 0.14 | -0.33 | 0.12 |
| r.squared | adj.r.squared | sigma | BIC | p.value |
|---|---|---|---|---|
| 0.004 | -0.003 | 26.98 | 1351.64 | 0.448 |
| dataset | r.squared | adj.r.squared | sigma | AIC | BIC |
|---|---|---|---|---|---|
| 1 | 0.004 | -0.003 | 27 | 1343 | 1352 |
| 2 | 0.005 | -0.002 | 27 | 1343 | 1352 |
| 3 | 0.005 | -0.002 | 27 | 1343 | 1351 |
| 4 | 0.004 | -0.003 | 27 | 1343 | 1352 |
| 5 | 0.004 | -0.003 | 27 | 1343 | 1352 |
| 6 | 0.004 | -0.003 | 27 | 1343 | 1352 |
| 7 | 0.005 | -0.002 | 27 | 1343 | 1352 |
| 8 | 0.005 | -0.002 | 27 | 1343 | 1352 |
| 9 | 0.005 | -0.002 | 27 | 1343 | 1352 |
| 10 | 0.004 | -0.003 | 27 | 1343 | 1352 |
| 11 | 0.005 | -0.002 | 27 | 1343 | 1352 |
| 12 | 0.004 | -0.003 | 27 | 1343 | 1352 |
| 13 | 0.004 | -0.003 | 27 | 1343 | 1352 |
Does a linear model for y using x seem appropriate?
Call:
lm(formula = y ~ x, data = filter(df, set == 1))
Coefficients:
(Intercept) x
53.425 -0.103
Models 2-13 all look about the same in terms of means, medians, correlations, regression models, but what happens if we plot the data?
span?These are, of course, the datasauRus dozen data sets described in Spiegelhalter, available in the datasauRus package, which you can install from CRAN, thanks to the work of Steph Locke.
library(datasauRus)
df <- datasaurus_dozen
The moral of the story: Never trust summary statistics alone, always visualize your data
This is Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing by Justin Matejka and George Fitzmaurice.
We’ll look at a couple of the animated plots they generate there.
Constable (1993) compared the inter-radial suture widths of urchins maintained on one of three food regimes
In an attempt to control for substantial variability in urchin sizes, the initial body volume of each urchin was measured as a covariate.
library(tidymodels)
library(readr)
library(broom.mixed)
urchins <-
# Data were assembled for a tutorial
# at https://www.flutterbys.com.au/stats/tut/tut7.5a.html
read_csv("https://tidymodels.org/start/models/urchins.csv") |>
# Change the names to be a little more verbose
setNames(c("food_regime", "initial_volume", "width")) |>
mutate(food_regime =
factor(food_regime,
levels = c("Initial", "Low", "High")))urchins dataFor each of 72 sea urchins, we know their
food_regime: either Initial, Low, or High),initial_volume), andwidth).Since the slopes appear to be different for at least two of the feeding regimes, let’s build a model that allows for two-way interactions. We’ll use a linear model for width which allows each food regime to generate a different slope and intercept for the effect of initial volume.
lm(width ~ initial_volume * food_regime, data = urchins) |> tidy() |>
select(term, estimate) |> kable(dig = 4) |> kable_styling(font_size = 24)| term | estimate |
|---|---|
| (Intercept) | 0.0331 |
| initial_volume | 0.0016 |
| food_regimeLow | 0.0198 |
| food_regimeHigh | 0.0214 |
| initial_volume:food_regimeLow | -0.0013 |
| initial_volume:food_regimeHigh | 0.0005 |
tidymodelsLinear Regression Model Specification (regression)
Computational engine: lm
It turns out that we’ll have several options for engines here.
fit()We’ll look at the results on the next slide.
lm_fit?tidy(lm_fit, conf.int = TRUE) |> select(term, estimate, conf.low, conf.high) |>
kable(dig = 4) |> kable_styling(font_size = 28)| term | estimate | conf.low | conf.high |
|---|---|---|---|
| (Intercept) | 0.0331 | 0.0139 | 0.0523 |
| initial_volume | 0.0016 | 0.0008 | 0.0023 |
| food_regimeLow | 0.0198 | -0.0061 | 0.0457 |
| food_regimeHigh | 0.0214 | -0.0076 | 0.0504 |
| initial_volume:food_regimeLow | -0.0013 | -0.0023 | -0.0002 |
| initial_volume:food_regimeHigh | 0.0005 | -0.0009 | 0.0019 |
Suppose that, for a publication, it would be particularly interesting to make a plot of the mean body size for urchins that started the experiment with an initial volume of 20ml. To create such a graph, we start with some new example data that we will make predictions for.
new_pointsWe’ll develop mean predictions and uncertainty intervals.
plot_data resultsplot_data resultsWould the results be different if we used a Bayesian approach?
stan_glm() function can be used, and this is available as an engine in tidymodels, where we need to specify prior and prior_intercept to fit a linear model.bayes_fit <- bayes_mod |>
fit(width ~ initial_volume * food_regime, data = urchins)
tidy(bayes_fit, conf.int = TRUE) |>
select(term, estimate, conf.low, conf.high) |>
kable(dig = 4) |> kable_styling(font_size = 24)| term | estimate | conf.low | conf.high |
|---|---|---|---|
| (Intercept) | 0.0330 | 0.0167 | 0.0492 |
| initial_volume | 0.0016 | 0.0009 | 0.0023 |
| food_regimeLow | 0.0198 | -0.0020 | 0.0415 |
| food_regimeHigh | 0.0216 | -0.0027 | 0.0460 |
| initial_volume:food_regimeLow | -0.0013 | -0.0021 | -0.0004 |
| initial_volume:food_regimeHigh | 0.0005 | -0.0007 | 0.0017 |
bayes_plot_data <-
new_points |>
bind_cols(predict(bayes_fit, new_data = new_points)) |>
bind_cols(predict(bayes_fit, new_data = new_points,
type = "conf_int"))
ggplot(bayes_plot_data, aes(x = food_regime, y = .pred)) +
geom_point() +
geom_errorbar(aes(ymin = .pred_lower, ymax = .pred_upper),
width = .2) +
labs(y = "urchin size",
title = "Bayesian model with t(1) prior distribution")| initial_volume | food_regime | .pred | .pred_lower | .pred_upper |
|---|---|---|---|---|
| 20 | Initial | 0.0642 | 0.0555 | 0.0729 |
| 20 | Low | 0.0588 | 0.0499 | 0.0678 |
| 20 | High | 0.0961 | 0.0870 | 0.1053 |
| initial_volume | food_regime | .pred | .pred_lower | .pred_upper |
|---|---|---|---|---|
| 20 | Initial | 0.0642 | 0.0554 | 0.0729 |
| 20 | Low | 0.0588 | 0.0498 | 0.0677 |
| 20 | High | 0.0960 | 0.0869 | 0.1055 |
From PLoS Computational Biology
From PLoS Computational Biology
Applied statistics is hard.
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. (R. A. Fisher)
So you collected data and analyzed the results. Now you want to do an after data gathering (post hoc) power analysis.
None of this is particularly smart.
Age_at_Diagnosis is a much much better name than ADx.Jeff Leek: “How to share data with a statistician”
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] broom.mixed_0.2.9.4 yardstick_1.2.0 workflowsets_1.0.1
[4] workflows_1.1.3 tune_1.1.2 rsample_1.2.0
[7] recipes_1.0.8 parsnip_1.1.1 modeldata_1.2.0
[10] infer_1.0.5 dials_1.2.0 scales_1.3.0
[13] tidymodels_1.1.1 lubridate_1.9.3 forcats_1.0.0
[16] stringr_1.5.1 purrr_1.0.2 readr_2.1.4
[19] tidyr_1.3.0 tibble_3.2.1 tidyverse_2.0.0
[22] broom_1.0.5 patchwork_1.1.3 mosaic_1.9.0
[25] mosaicData_0.20.4 ggformula_0.12.0 dplyr_1.1.4
[28] Matrix_1.6-4 ggplot2_3.4.4 lattice_0.21-9
[31] janitor_2.2.0 kableExtra_1.3.4 knitr_1.45
[34] datasauRus_0.1.6
loaded via a namespace (and not attached):
[1] shinythemes_1.2.0 splines_4.3.2 later_1.3.1
[4] hardhat_1.3.0 xts_0.13.1 rpart_4.1.21
[7] lifecycle_1.0.4 StanHeaders_2.26.28 globals_0.16.2
[10] processx_3.8.2 vroom_1.6.4 MASS_7.3-60
[13] crosstalk_1.2.1 backports_1.4.1 magrittr_2.0.3
[16] rmarkdown_2.25 yaml_2.3.7 httpuv_1.6.12
[19] pkgbuild_1.4.2 minqa_1.2.6 abind_1.4-5
[22] rvest_1.0.3 tensorA_0.36.2 nnet_7.3-19
[25] ipred_0.9-14 labelled_2.12.0 lava_1.7.3
[28] inline_0.3.19 listenv_0.9.0 parallelly_1.36.0
[31] svglite_2.1.2 codetools_0.2-19 DT_0.30
[34] xml2_1.3.5 tidyselect_1.2.0 rstanarm_2.26.1
[37] bayesplot_1.10.0 farver_2.1.1 lme4_1.1-35.1
[40] matrixStats_1.1.0 stats4_4.3.2 base64enc_0.1-3
[43] webshot_0.5.5 jsonlite_1.8.7 ellipsis_0.3.2
[46] ggridges_0.5.4 survival_3.5-7 iterators_1.0.14
[49] systemfonts_1.0.5 foreach_1.5.2 tools_4.3.2
[52] Rcpp_1.0.11 glue_1.6.2 prodlim_2023.08.28
[55] gridExtra_2.3 xfun_0.41 mgcv_1.9-0
[58] distributional_0.3.2 loo_2.6.0 withr_2.5.2
[61] fastmap_1.1.1 boot_1.3-28.1 fansi_1.0.5
[64] shinyjs_2.1.0 callr_3.7.3 digest_0.6.33
[67] timechange_0.2.0 R6_2.5.1 mime_0.12
[70] colorspace_2.1-0 gtools_3.9.5 markdown_1.11
[73] threejs_0.3.3 utf8_1.2.4 generics_0.1.3
[76] data.table_1.14.8 class_7.3-22 prettyunits_1.2.0
[79] httr_1.4.7 htmlwidgets_1.6.3 pkgconfig_2.0.3
[82] dygraphs_1.1.1.6 gtable_0.3.4 timeDate_4022.108
[85] GPfit_1.0-8 furrr_0.3.1 htmltools_0.5.7
[88] posterior_1.5.0 gower_1.0.1 snakecase_0.11.1
[91] rstudioapi_0.15.0 tzdb_0.4.0 reshape2_1.4.4
[94] checkmate_2.3.0 nloptr_2.0.3 nlme_3.1-163
[97] curl_5.1.0 zoo_1.8-12 parallel_4.3.2
[100] miniUI_0.1.1.1 pillar_1.9.0 grid_4.3.2
[103] vctrs_0.6.5 shinystan_2.6.0 promises_1.2.1
[106] xtable_1.8-4 lhs_1.1.6 evaluate_0.23
[109] cli_3.6.1 compiler_4.3.2 rlang_1.1.2
[112] crayon_1.5.2 rstantools_2.3.1.1 future.apply_1.11.0
[115] labeling_0.4.3 ps_1.7.5 plyr_1.8.9
[118] stringi_1.8.2 rstan_2.32.3 viridisLite_0.4.2
[121] QuickJSR_1.0.8 munsell_0.5.0 colourpicker_1.3.0
[124] mosaicCore_0.9.4.0 V8_4.4.0 hms_1.1.3
[127] bit64_4.0.5 future_1.33.0 shiny_1.8.0
[130] highr_0.10 haven_2.5.4 igraph_1.5.1
[133] RcppParallel_5.1.7 bit_4.0.5 DiceDesign_1.9
431 Class 26 | 2023-12-07 | https://thomaselove.github.io/431-2023/